Tidal Volume Level Estimation Using Respiratory Sounds

Respiratory sounds have been used as a noninvasive and convenient method to estimate respiratory flow and tidal volume. However, current methods need calibration, making them difficult to use in a home environment. A respiratory sound analysis method is proposed to estimate tidal volume levels during sleep qualitatively. Respiratory sounds are filtered and segmented into one-minute clips, all clips are clustered into three categories: normal breathing/snoring/uncertain with agglomerative hierarchical clustering (AHC). Formant parameters are extracted to classify snoring clips into simple snoring and obstructive snoring with the K-means algorithm. For simple snoring clips, the tidal volume level is calculated based on snoring last time. For obstructive snoring clips, the tidal volume level is calculated by the maximum breathing pause interval. The performance of the proposed method is evaluated on an open dataset, PSG-Audio, in which full-night polysomnography (PSG) and tracheal sound were recorded simultaneously. The calculated tidal volume levels are compared with the corresponding lowest nocturnal oxygen saturation (LoO2) data. Experiments show that the proposed method calculates tidal volume levels with high accuracy and robustness.


Introduction
Sleep quality and sleep time are both important for human health. Sleep quality is the measurement of how restful and restorative the sleep process proceeds. Enough sleep hours do not necessarily guarantee to get the most restful type of sleep. More than 80 sleep disorders are known to afect sleep quality. Among all these factors that cause poor sleep quality, sleep-related breathing disorders (SRBD) is the second one of all sleep-related disorders (the frst one is insomnia) [1]. SRBD is the condition of abnormal and difcult respiration during sleep, which has efects on the balance of oxygen and carbon dioxide in the blood. Tidal volume is one of the parameters for monitoring respiratory ventilation and pulmonary function. Tidal volume is the amount of air that moves in or out of the lungs with each respiratory cycle. Te normal tidal volume is around 500 mL in an average healthy adult male and approximately 400 mL in a healthy female. Te tidal volume during sleep can be measured by many methods, such as polysomnography (PSG) and inductance plethysmography [2]. However, these methods are expensive, require a specialized operation, and cause uncomfortable sleeping. Terefore, there is a need for a nonintrusive, easy-operating method that can be used in a home environment. Te acoustic method is getting popular in respiration monitoring as it only involves acquiring and processing respiratory sound signals to estimate tidal volume. Te development of smartphones and wearable devices also made it possible to monitor respiration and tidal volume during sleep. Monitoring respiratory quality using respiratory sound is becoming a hotspot in recent years.
Many researchers have focused on analyzing the correlation between respiratory sound and respiratory airfow due to its potential for assessing snoring risk and estimating tidal volume. Various models or algorithms are proposed to estimate respiratory fow through respiratory sounds. Gavriely and Cugell proposed that the breath-sound amplitude (BAS) and fow (F) generally follow a 1.75-power relationship [3]. Yap and Moussavi proposed a method to use average power and an exponential model to estimate respiratory fow through tracheal sound, which reached an estimation error of 5.8 ± 3.0% [4]. Reljin et al. used the blanket fractal dimension (BFD) as the parameter for estimating the tidal volume from tracheal sounds recorded by an Android smartphone, the smallest normalized rootmean-squared error of 15.877% ± 9.246% was obtained with the BFD and exponential model [5]. Yadollahi and Moussavi extracted the average power, the logarithm of the variance, and the logarithm of the envelope of tracheal sound as a feature, they compared the ability of these features to ft the fow-sound relationship, suggesting that the logarithm of the variance is the best feature to describe the fow-sound relationship with a linear model [6]. Other studies indicated that the Shannon entropy and sound variance also have an exponential relationship with the respiratory fow [7,8].
Most of these papers indicate that the fow rate and respiratory sound amplitude follow a power law. Tis relationship used to estimate the respiratory fow rate can be presented in the following equation: (1) F est is the estimated fow rate (L/min), E is the respiratory sound amplitude, and C 1 and C 2 are the coefcients. C 1 and C 2 are determined by the human upper airway structure and can be calculated via a few breaths with a known fow rate for each participant, this procedure is called calibration. Current methods require calibration to determine the model coefcients C 1 and C 2 . Yadollahi and Moussavi found that the parameters of the fow-sound relationship during sleep and wakefulness are diferent [9]. Terefore, for monitoring the tidal volume during sleep, the model parameters should be calibrated with sleep respiratory sounds.
However, these methods mentioned above are only applied to normal respiration, and calibration is needed for each case. Furthermore, these methods had not worked well for respiration during snoring. During snoring, the sound amplitude is higher than normal breathing, on contrary, the respiratory airfow is lower than normal breathing. Te main reason is that the upper airway is usually collapsed or obstructed, and is highly variable during snoring. Respiration monitoring during snoring is important as it greatly afects sleep quality. During snoring, the upper airway is partially or completely blocked, and the respiratory airfow is limited or vanishes. Snoring usually leads to intermittent hypoxemia (IH), hypercapnia, arousal, hypertension, and sleep fragmentation. In this paper, a qualitative tidal volume estimation by a respiratory sound signal is proposed. It only used respiratory sound for analysis and does not need calibration. Terefore, the respiratory sound data could be easily collected by recording equipment and could be used in a home environment.
Te proposed method consists of 4 main steps. First, the respiratory sounds are preprocessed into clips. Second, all clips are clustered into the normal breathing/snoring/uncertain categories with agglomerative hierarchical clustering (AHC). Tird, the snoring clips are classifed into simple snoring and apneic snoring with the K-means algorithm based on formant parameters and time domain parameters. Finally, the maximum breathing pause interval (MBPI) is calculated for apneic snoring clips to set the tidal volume to a medium or low level. Te last time is calculated for simple snoring to set the tidal volume to a high-or medium-level. All the predictions are compared with LoO 2 (lowest nocturnal oxygen saturation) to evaluate the performance. All steps are unsupervised and do not need any calibration. Te fow of the proposed method is shown in Figure 1.

Materials and Methods
Te tracheal sounds are extracted from the PSG-Audio dataset. Te dataset comprises 212 polysomnograms along with synchronized tracheal sound. Te dataset contains edf fles comprising polysomnogram signals and rml fles containing all annotations by the medical team [10]. Te edf fles contain 20 channels, the SpO 2 (blood oxygen saturation level, in channel 15) and tracheal sound (in channel 19) data are extracted from the edf fles for analysis. Te SpO 2 measures the amount of oxygen in the blood. Te corresponding respiratory events (obstructive apnea/mixed apnea/hypopnea) are extracted from the rml fles. Te sampling frequency of SpO 2 and tracheal sound is 1 Hz and 48000 Hz, respectively. A fve minutes data clip is shown in Figure 2.

Agglomerative Hierarchical Clustering
2.1.1. Processing. Te frst step of preprocessing is fltering and denoising. As the respiratory sound energy of healthy people is usually concentrated in the low-frequency range of [50, 2500] Hz, a 50-2500 Hz Butterworth bandpass flter is used to flter noise. Te sampling rate of recording fles is downsampled to 5000 Hz. Te second step of preprocessing is segmentation. Te duration of the clip length is settled by considering the micro and the macro aspect. One clip should be short enough to separate each breathing stage; therefore, the audio signal in one clip is stable. Te length of the clip is better to be cut with 5 to 10 breath periods for analysis. Te usual breath period during sleep is 3 to 6 seconds. Te length of 30 seconds to 60 seconds is considerable. Furthermore, considering the time of apnea in a serious case, it usually takes more than 30 seconds. In this paper, the length of segmentation is set at 60 seconds.

Feature Extraction.
According to research about the human hearing mechanism, the human ear has diferent hearing sensitivity to sound waves of diferent frequencies.
Te human ear has a higher resolution of low-frequency sounds than high-frequency sounds. Te Mel scale is a mapping from the human auditory perceived frequency to the actual frequency of the sound. By converting the frequencies to the Mel scale, features can better match the human auditory perception [6]. Te Mel scale describes the nonlinear characteristics of the human ear frequency, and its relationship with frequency can be approximated by the following equation.
Mel(f) � 2595 * log 10 f 700 f is the frequency in Hertz. Te Mel-scale Frequency Cepstral Coefcients (MFCC) is a cepstral parameter extracted in the Mel-scale frequency domain [11]. MFCC were extracted from each clip fle as the feature. Te MFCC extraction algorithm usually includes windowing the signal into frames, and applying the fast Fourier transform (FFT) on frames to get the short-time Fourier transform spectrum (STFT). Ten, the STFT spectrum was fltered with Mel-flter banks to get the Mel-spectrum, the Melspectrum was transformed into Mel-frequency cepstrum by taking the logarithm and then followed by applying the discrete cosine transform (DCT) to get MFCC coefcients. Te MFCC feature vector describes the power spectral envelope of a single frame. Figure 3 shows the waveform, the Mel-spectrum, and the MFCC of a snoring sound clip with a duration of 60 seconds.

Similarity Calculation.
Te MFCC of each clip is a two-dimensional matrix, each column presents for a frame, and each row in the matrix corresponds to the Melfrequency cepstral coefcients for the corresponding frame. As the respiratory sound signal is quasiperiodic, the MFCC matrix can be averaged by each row to get a one-   Journal of Healthcare Engineering 3 dimension vector. As a vector can be presented as a point in a high-dimension space by its Cartesian coordinates, the MFCC matrix can be presented as points in a highdimension space. Te distance between the two clips can be measured by the distance between these two points. Based on our experiences, the Euclidean distance gave the most satisfying cluster result. Te Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. In general, if p and q are two points in n-dimensional Euclidean space, then the distance between them can be calculated by the following equation: (3)

Agglomerative Hierarchical Clustering.
Hierarchical clustering is a method of cluster analysis that can discover the structure of the dataset in an unsupervised way. It seeks iteratively merging nodes into bigger clusters (agglomerative), or divisive clustering nodes in the inverse (divisive) to build a hierarchy of all data. Agglomerative hierarchical clustering (AHC) is the most common type of hierarchical clustering [12,13]. Pairs of clusters are successively merged until all clusters have been merged into one big cluster that contains all objects. At each iteration, two nodes or clusters, which have the minimum distance are merged. Te result is a tree-based representation of all the objects, named a dendrogram. Te number of clusters needs to be set before the algorithm begins.
A 120 minutes length fle (2 hours) was selected from all the data and segmented into 60 seconds length clips for demonstration; therefore, 120 clips were used in the experiments. Te STFT spectrum window length is 1000 ms with an overlap of 500 ms. Te 40 Mel-scale flters were set in MFCC extraction. Te distance matrix size is a symmetry matrix with a size of (120, 120). Te dendrogram of the clustering result is shown in Figure 4. Based on the structure of the dendrogram, the dendrogram was divided into 3 clusters. Cluster 1, cluster 2, and cluster 3 are presented with cyan, magenta, and yellow, respectively. Te dendrogram is shown in Figure 4, and the dendrogram is truncated for showing the main structure for the better visualization efect. Te properties of each cluster are listed in Table 1.
One clip was chosen from each cluster as an example for analysis. Te waveform and Mel-spectrum of examples present for each example are shown in Figure 5. Figure 5(a) is a spectrum of snoring. Te snoring sounds are almost the same in amplitude and evenly spaced, the pitch of the snoring sound is in the low-frequency range and corresponds to a fundamental frequency with associated harmonics, and inspiratory is louder than expiratory. Figure 5(b) is a spectrum of normal respiration. It is characterized by a broader spectrum and is audible both during the inspiratory and expiratory phases. Figure 5(c) is a spectrum of uncertain types. Te signal is very weak, and its spectrum has almost equal energy at frequencies below 2000 Hz. It is mixed with the weak breath, but the signal level is insufcient for analysis.  Journal of Healthcare Engineering

Snoring Classifcation Based on K-Means Algorithm.
Snoring occurs when the upper airways collapse, air moves around the foppy tissue near the back of the throat, and causes the tissue to vibrate. Simple snoring (also called benign snoring) occurs when there is a partial collapse of the soft tissues. As such, simple snoring is generally not considered a health threat. Apneic Snoring (also called obstructive sleep apnea-related snoring) is caused by partial or complete obstruction of the airway, and apneic snoring causes a partial or complete airfow stop, resulting in little or no oxygen going to the blood [14,15]. For the apneic snoring, at the end of the obstruction, the closed upper airway is suddenly opened, and the pressures of the upper and lower airfows are suddenly balanced, causing the upper airway to repeat multiple openings and closings in a short period, producing a popping sound. Te collapse degree and resistance of the upper airway may vary greatly from the beginning to the end of inspiration, thus, afecting the vibration of the upper airway tissue [16]. Te snoring sounds in patients with obstructive sleep apnea and with simple snoring have diferent characteristics and efects on breath quality. It is essential to discriminate between these two diferent types of snoring for evaluating the infuence on tidal volume. Formant frequencies represent the resonance frequencies of the airways and change with the upper airway anatomy. A formant is the broad spectral maximum produced by an acoustic resonance of the human vocal tract [17]. Formants represent the direct source of pronunciation information, and the extraction and trajectory tracking of formants play an important role in speech recognition and speech synthesis. Te formants F1-F3 are the three lowest resonant frequencies of the vocal tract. F1 is associated with the degree of pharyngeal constriction and the height of the tongue. F2 refects the degree of the tongue's relative advancement position to its neutral position. F3 is related to the degree of lip rounding. Among F1-F3, F1 carries more information than others as it is associated with severity of apnea. Like speech pronunciation, snoring sounds are also produced depending on the shape and physical conditions of the upper airway, the formant of snoring can be extracted as a snoring feature [18]. Ng et al. proposed that apneic snoring has a high formant frequency than simple snoring in F1, and a threshold value of F1 � 470 Hz can be used to distinguish apneic snoring from simple snoring [19]. Sola Soler et al. suggested that the formant standard deviation of OSA snoring is higher than simple snoring [20].
Tese studies used the formant parameters to distinguish simple snoring from apneic snoring, and all emphasized the decisive role of F1. However, some cases may be misjudged by these methods. Te reason is that the diference between the speech formant and the snoring formant is not considered. Te most important formant analysis in speech processing is the formant tracks. Te spacing between the word formant is not taken into consideration in speech processing. On the contrary, in applications such as speech recognition, the efect of spacing needs to be eliminated. Te frequently used methods are dynamic time warping (DWT). By locally scaling the speech sequence, DWT eliminates the infuence of speech rate and word spacing, so that the morphology of the two speech sequences is as consistent as possible, and the maximum possible similarity is obtained. But in snoring recognition, the interval between breathing is an important parameter as it is associated with airfow reduction time, and the interval of apneic snoring is usually    larger and more irregular than simple snoring. To solve this problem, this paper extracted the standard deviation of the formant interval, together with the standard deviation of the formant frequencies as parameters, and used the K-means algorithm to discriminate between simple snoring and obstructive snoring by unsupervised clustering. K-means clustering is an unsupervised learning algorithm, it groups the unlabeled dataset into diferent clusters [21]. K defnes the number of predefned clusters that need to be created in the process, here the K is set as 2.
Te linear predictive analysis (LPC) method is one of the fast and more efective formant frequency estimation methods. Te system function of the human vocal tract can be uniquely determined by a set of linear prediction coefcients, so the efect of vocal tract modulation can be estimated through LPC analysis. Te formant of snoring can be obtained. Te sound signals were windowed with a Hamming window of 20 ms with 50% overlap. In each window, a 14th-order LPC analysis is performed, and the LPC parameters were calculated via the Yule-Walker autoregressive method with the Levinson-Drubin recursive procedure. Te standard deviation of F1 frequencies and the standard deviation of F1 interval are extracted to form a 2dimensional feature vector. Te snoring cluster result is shown in Figure 6. Te apneic snoring and simple snoring are marked with red and cyan dots, respectively. After Kmeans clustering, the snoring cluster result of AHC is divided into 2 subclusters: cluster 0 and cluster 1. Te property of all 4 clusters is shown in Table 2. Te spectrum of the example clip chosen from cluster 0 and cluster 1 is shown in Figure 7, the formant is displayed with black dots on the spectrum.

Tidal Volume Level Estimation.
For each cluster, different parameters are extracted and the corresponding tidal volume levels are determined based on these parameters. Te tidal volume levels are divided into three grades: high, medium, and low. Te tidal volume level is calculated for each cluster.
Cluster 2 contains the normal breathing clips, although there are fuctuations during normal respiration, the tidal volume levels of normal breathing can roughly be set as high.
Cluster 1 contains simple snoring. According to Hofstein's research, simple snoring does not cause a sustained deterioration of MnO 2 (mean nocturnal oxygen saturation) but cause signifcantly the variability of LoO 2 (lowest nocturnal oxygen saturation) [22]. Based on this research, the tidal volume level during simple snoring beginning is similar to normal respiration, but after a certain duration, the fuctuation of nocturnal oxygen saturation increases and deteriorates ventilation quality at a moderate level. Although the accurate SpO 2 drop time is not clear, according to the research by Gruber, the interval to equilibration of oxygen saturation is within 4.5 minutes [23]. Terefore, the SpO 2 drop threshold is set at 4 minutes, meaning that when the normal breathing ends and simple snoring starts, after approximately 4 minutes, the SpO 2 drops to a medium level with high probability.
Cluster 0 contains apneic snoring. Te breathing pause lasts longer than normal breathing during apnea. Based on the research by Ma et al., nocturnal hypoxemia severity is proportional to the pause time [24]. To evaluate the severity of hypoxemia, the maximum breathing pause interval (MBPI) is calculated as a parameter. According to the apnea defnition, the threshold to distinguish the low/medium grade of apneic snoring is set to 10 seconds. Te criterion for tidal volume level estimation is listed in Table 3.

Results and Discussion
Te SpO 2 is a reading that shows the amount of oxygen available in human blood to deliver to the heart, brain, lungs, and other muscles and organs. Te LoO 2 (lowest nocturnal oxygen saturation) is the lowest SpO 2 value during a certain time and has a high correlation with tidal volume. Te LoO 2 is divided into 3 levels: large than 95% is considered a high level, less than 90% is considered low (hypoxemia), and between 95% and 90% is considered medium (mild) hypoxemia. Te summarized results are shown in Figure 8. Te frst row is the clustering result, the x-axis represents the clip index, and each clip is 60 seconds in length. Each clip is classifed into apneic snoring/simple snoring/breathing/uncertain types. Te second row is the tidal volume level calculated by the proposed algorithm. Te third row is the LoO 2 , which is divided into high/medium/low levels, and the uncertain level corresponds to the uncertain clustering type. Te fourth row is the SpO 2 level that is used to calculate the third row.
Six clips were selected as representatives, which are shown in Figure 9. Figure 9(a) is a normal respiration state at the 13th minute, the corresponding SpO 2 is stable and LoO 2 is above 95%. Figure 9(b) is apneic snoring with MBPI ≤ 10 at the 19th minute, the SpO 2 fuctuates, and LoO 2 is between 95% and 90%. Figure 9(c) is apneic snoring with MBPI > 10 at the 20th minute, the SpO 2 fuctuates dramatically, and LoO 2 is below 90%. Figure 9(d) is simple snoring at the 16th minute, the SpO 2 is at a high level as in (a). Figure 9(e) is simple snoring at the 43th minute, the SpO 2 drops slightly, and the LoO 2 drops to between 95% and 90%. Figure 9(f ) is an uncertain case by which the signal is insufcient to calculate the SpO 2 level.
Te accuracy is calculated by equation (4). Six patients with diferent apean-hypopnea index (AHI) were selected to test the efectiveness and robustness of the proposed method. AHI is defned as the number of apnea or hypopnea per hour during sleep. It is used as a parameter for the evaluation of the OSA severity. AHI less than 15 is considered mild apnea. AHI between 15 and 30 denotes moderate apnea, while a greater than 30 is considered severe. Te characteristic of selected data and algorithm performance are shown in Table 4. Te algorithm accuracy is 88.3% in the group with mild apnea. As for the moderate apnea group, the algorithm accuracy slightly drops to 85.8%. In the severe apnea group where the sound signal contains ambient noise, the algorithm accuracy is still above 83%.

Conclusion
In this study, a tidal volume level prediction method is proposed based on unsupervised clustering and snoring parameters. Tis method can provide a coarse-grained tidal volume level estimation that does not need any calibration. In addition, this method can be used for sleep breathing monitoring in a home environment. However, the accuracy of the method in this study is not very well because noise such as ambient noise will cause misjudgement, also breathing during sleep is afected by many other factors such as sleep position, pulmonary disease, and body movement, these factors cannot be captured by breathing sound. We are going to improve the performance by incorporating other factors in the future.

Data Availability
Te data supporting the current study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.